Goto

Collaborating Authors

 data reduction


A Consolidated Cross-Validation Algorithm for Support Vector Machines via Data Reduction

Neural Information Processing Systems

We propose a consolidated cross-validation (CV) algorithm for training and tuning the support vector machines (SVM) on reproducing kernel Hilbert spaces. Our consolidated CV algorithm utilizes a recently proposed exact leave-one-out formula for the SVM and accelerates the SVM computation via a data reduction strategy. In addition, to compute the SVM with the bias term (intercept), which is not handled by the existing data reduction methods, we propose a novel two-stage consolidated CV algorithm. With numerical studies, we demonstrate that our algorithm is about an order of magnitude faster than the two mainstream SVM solvers, kernlab and LIBSVM, with almost the same accuracy.


Edge-Based Predictive Data Reduction for Smart Agriculture: A Lightweight Approach to Efficient IoT Communication

Krekovic, Dora, Kusek, Mario, Zarko, Ivana Podnar, Le-Phuoc, Danh

arXiv.org Artificial Intelligence

The rapid growth of IoT devices has led to an enormous amount of sensor data that requires transmission to cloud servers for processing, resulting in excessive network congestion, increased latency and high energy consumption. This is particularly problematic in resource-constrained and remote environments where bandwidth is limited, and battery-dependent devices further emphasize the problem. Moreover, in domains such as agriculture, consecutive sensor readings often have minimal variation, making continuous data transmission inefficient and unnecessarily resource intensive. To overcome these challenges, we propose an analytical prediction algorithm designed for edge computing environments and validated through simulation. The proposed solution utilizes a predictive filter at the network edge that forecasts the next sensor data point and triggers data transmission only when the deviation from the predicted value exceeds a predefined tolerance. A complementary cloud-based model ensures data integrity and overall system consistency. This dual-model strategy effectively reduces communication overhead and demonstrates potential for improving energy efficiency by minimizing redundant transmissions. In addition to reducing communication load, our approach leverages both in situ and satellite observations from the same locations to enhance model robustness. It also supports cross-site generalization, enabling models trained in one region to be effectively deployed elsewhere without retraining. This makes our solution highly scalable, energy-aware, and well-suited for optimizing sensor data transmission in remote and bandwidth-constrained IoT environments.


ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Zhang, Hengrui, Hui, Yulong, Liu, Yihao, Zhang, Huanchen

arXiv.org Artificial Intelligence

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.


What Data is Really Necessary? A Feasibility Study of Inference Data Minimization for Recommender Systems

Leysen, Jens, Favier, Marco, Goethals, Bart

arXiv.org Artificial Intelligence

Data minimization is a legal principle requiring personal data processing to be limited to what is necessary for a specified purpose. Operationalizing this principle for recommender systems, which rely on extensive personal data, remains a significant challenge. This paper conducts a feasibility study on minimizing implicit feedback inference data for such systems. We propose a novel problem formulation, analyze various minimization techniques, and investigate key factors influencing their effectiveness. We demonstrate that substantial inference data reduction is technically feasible without significant performance loss. However, its practicality is critically determined by two factors: the technical setting (e.g., performance targets, choice of model) and user characteristics (e.g., history size, preference complexity). Thus, while we establish its technical feasibility, we conclude that data minimization remains practically challenging and its dependence on the technical and user context makes a universal standard for data `necessity' difficult to implement.




Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

Chen, Fei, Zhou, Wenchi

arXiv.org Artificial Intelligence

In order to increase the effectiveness of model training, data reduction is essential to data-centric Artificial Intelligence (AI). It achieves this by locating the most instructive examples in massive datasets. To increase data quality and training efficiency, the main difficulty is choosing the best examples rather than the complete datasets. In this paper, we propose an effective data reduction strategy based on Pointwise V-Information (PVI). To enable a static method, we first use PVI to quantify instance difficulty and remove instances with low difficulty. Experiments show that classifier performance is maintained with only a 0.0001% to 0.76% decline in accuracy when 10%-30% of the data is removed. Second, we train the classifiers using a progressive learning strategy on examples sorted by increasing PVI, accelerating convergence and achieving a 0.8% accuracy gain over conventional training. Our findings imply that training a classifier on the chosen optimal subset may improve model performance and increase training efficiency when combined with an efficient data reduction strategy. Furthermore, we have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese Natural Language Processing (NLP) tasks and base models, yielding insightful results for faster training and cross-lingual data reduction.


Scale Efficient Training for Large Datasets

Zhou, Qing, Gao, Junyu, Wang, Qi

arXiv.org Artificial Intelligence

The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement.To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum.We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples.SeTa reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at https://github.com/mrazhou/SeTa.


SoK: Knowledge is All You Need: Last Mile Delivery for Automated Provenance-based Intrusion Detection with LLMs

Cheng, Wenrui, Zhu, Tiantian, Xiong, Chunlin, Sun, Haofei, Wang, Zijun, Jing, Shunan, Lv, Mingqi, Chen, Yan

arXiv.org Artificial Intelligence

Recently, provenance-based intrusion detection systems (PIDSes) have been widely proposed for endpoint threat analysis. However, due to the lack of systematic integration and utilization of knowledge, existing PIDSes still require significant manual intervention for practical deployment, making full automation challenging. This paper presents a disruptive innovation by categorizing PIDSes according to the types of knowledge they utilize. In response to the prevalent issue of ``knowledge silos problem'' in existing research, we introduce a novel knowledge-driven provenance-based intrusion detection framework, powered by large language models (LLMs). We also present OmniSec, a best practice system built upon this framework. By integrating attack representation knowledge, threat intelligence knowledge, and benign behavior knowledge, OmniSec outperforms the state-of-the-art approaches on public benchmark datasets. OmniSec is available online at https://anonymous.4open.science/r/PIDS-with-LLM-613B.


The Clear Sky Corridor: Insights Towards Aerosol Formation in Exoplanets Using An AI-based Survey of Exoplanet Atmospheres

Ashtari, Reza, Stevenson, Kevin B., Sing, David, Lopez-Morales, Mercedes, Alam, Munazza K., Nikolov, Nikolay K., Evans-Soma, Thomas M.

arXiv.org Artificial Intelligence

Producing optimized and accurate transmission spectra of exoplanets from telescope data has traditionally been a manual and labor-intensive procedure. Here we present the results of the first attempt to improve and standardize this procedure using artificial intelligence (AI) based processing of light curves and spectroscopic data from transiting exoplanets observed with the Hubble Space Telescope's (HST) Wide Field Camera 3 (WFC3) instrument. We implement an AI-based parameter optimizer that autonomously operates the Eureka pipeline to produce homogeneous transmission spectra of publicly available HST WFC3 datasets, spanning exoplanet types from hot Jupiters to sub-Neptunes. Surveying 42 exoplanets with temperatures between 280 and 2580 Kelvin, we confirm modeled relationships between the amplitude of the water band at 1.4um in hot Jupiters and their equilibrium temperatures. We also identify a similar, novel trend in Neptune/sub-Neptune atmospheres, but shifted to cooler temperatures. Excitingly, a planet mass versus equilibrium temperature diagram reveals a "Clear Sky Corridor," where planets between 700 and 1700 Kelvin (depending on the mass) show stronger 1.4um H2O band measurements. This novel trend points to metallicity as a potentially important driver of aerosol formation. As we unveil and include these new discoveries into our understanding of aerosol formation, we enter a thrilling future for the study of exoplanet atmospheres. With HST sculpting this foundational understanding for aerosol formation in various exoplanet types, ranging from Jupiters to sub-Neptunes, we present a compelling platform for the James Webb Space Telescope (JWST) to discover similar atmospheric trends for more planets across a broader wavelength range.